Client Report - Can You Predict That?

Course DS 250

Author

HENRY FELIPE

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here

# import your data here using pandas and the URL

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

The charts show clear differences between homes built before and after 1980. Older homes tend to have lower net prices and smaller living areas, while newer homes generally have higher values in both features. The distributions show strong separation, meaning net price and living area contain useful predictive information for machine learning.

Show the code
# Include and execute your code here


# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)

# Label target variable
df["before1980_label"] = df["before1980"].map({1: "Before 1980", 0: "1980 or newer"})

# Chart 1: Boxplot of Net Price
ggplot(df, aes(x='before1980_label', y='netprice', fill='before1980_label')) + \
    geom_boxplot() + \
    scale_y_log10() + \
    labs(
        title="Net Price vs Home Age Category",
        x="Home Built",
        y="Net Price (log scale)"
    ) + \
    theme_bw()
Show the code
# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)

# Label target variable
df["before1980_label"] = df["before1980"].map({1: "Before 1980", 0: "1980 or newer"})

# --- CHART 2: Livearea Density Plot by Before1980 ----
ggplot(df, aes(x='livearea', color='before1980_label', fill='before1980_label')) + \
    geom_density(alpha=0.4) + \
    scale_x_log10() + \
    labs(
        title="Distribution of Home Size (Livearea) by Home Age Category",
        x="Live Area (square feet, log scale)",
        y="Density"
    ) + \
    theme_bw()

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

For this task, I tested three different classification models—Logistic Regression, Decision Tree, and Random Forest—to predict whether a home was built before 1980. Logistic Regression performed well but is limited because it assumes linear relationships, while the Decision Tree achieved perfect accuracy but is more likely to overfit the training data. The Random Forest model ultimately became my final choice because it consistently reached 100% accuracy, handled complex nonlinear patterns between housing features, and reduced overfitting.

Show the code
# Include and execute your code here

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)

# Clean data: remove leakage variable
df = df.drop(columns=["tasp"], errors="ignore")

# Target
y = df["before1980"]

# Features
X = df.drop(columns=["before1980", "parcel"])

# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)

# Train/test split WITH stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# ----------------------------------------------------
# MODEL 1 — Logistic Regression (scaled data)
# ----------------------------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_model = LogisticRegression(max_iter=3000)
log_model.fit(X_train_scaled, y_train)
log_pred = log_model.predict(X_test_scaled)
log_acc = accuracy_score(y_test, log_pred)

# ----------------------------------------------------
# MODEL 2 — Decision Tree
# ----------------------------------------------------
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
tree_acc = accuracy_score(y_test, tree_pred)

# ----------------------------------------------------
# MODEL 3 — Random Forest (Final Choice)
# ----------------------------------------------------
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)

# Display all accuracies
print("Logistic Regression Accuracy:", round(log_acc, 4))
print("Decision Tree Accuracy:", round(tree_acc, 4))
print("Random Forest Accuracy:", round(rf_acc, 4))
Logistic Regression Accuracy: 0.9965
Decision Tree Accuracy: 1.0
Random Forest Accuracy: 1.0

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.

type your results and analysis here

Show the code
# Include and execute your code here


import pandas as pd
from lets_plot import *
from sklearn.ensemble import RandomForestClassifier

LetsPlot.setup_html()


X_real = df.drop(columns=["before1980", "parcel", "tasp"], errors="ignore")


features = X_real.columns.tolist()

# Train Random Forest s
rf_filtered = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42
)
rf_filtered.fit(X_real, y)

# Build feature importance table
importance_df = pd.DataFrame({
    "feature": features,
    "importance": rf_filtered.feature_importances_
}).sort_values(by="importance", ascending=False)

importance_df.head(15)

# Plot Feature Importance (lets_plot compatible)
(
    ggplot(importance_df.head(15), aes(x='feature', y='importance'))
    + geom_bar(stat="identity", fill="#2C7BB6")
    + coord_flip()
    + labs(
        title="Top Feature Importances (Using Real Housing Features)",
        x="Feature",
        y="Importance Score"
    )
    + theme_bw()
)

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

Model Quality Evaluation

To evaluate my Random Forest classifier, I used accuracy, precision, and recall. The model achieved near-perfect accuracy, meaning almost all predictions matched the true labels. Its precision was extremely high, showing the model rarely labeled newer homes as “before 1980” by mistake. The recall was also very high, meaning the model successfully identified almost all homes that truly were built before 1980. Together, these metrics confirm that the classifier is both reliable and consistent across different types of prediction errors.

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 1

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.

All three models performed well, but Logistic Regression made a few errors, and the Decision Tree overfit by relying almost entirely on yrbuilt. The Random Forest achieved perfect accuracy while using multiple features more realistically. Because it is the most stable and generalizable model, the Random Forest is the best choice for the client.

Show the code
# Include and execute your code here


import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt



# train/test split 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# --------------------------------------
# 2. Define models
# --------------------------------------
log_reg = LogisticRegression(max_iter=2000)
tree = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(n_estimators=300, random_state=42)

models = {
    "Logistic Regression": log_reg,
    "Decision Tree": tree,
    "Random Forest": rf
}

# --------------------------------------
# 3. Train models and create Confusion Matrices
# --------------------------------------
for name, model in models.items():
    print(f"\n============================")
    print(f"MODEL: {name}")
    print("============================")

    model.fit(X_train, y_train)
    preds = model.predict(X_test)

    # Print Confusion Matrix
    cm = confusion_matrix(y_test, preds)
    print("Confusion Matrix:\n", cm)

    # Display matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(cmap="Blues")
    plt.title(f"{name} - Confusion Matrix")
    plt.show()

# --------------------------------------
# 4. Feature Importance (or Coefficients)
# --------------------------------------

# Logistic Regression coefficients
log_importance = pd.DataFrame({
    "feature": X.columns,
    "importance": log_reg.coef_[0]
}).sort_values(by="importance", ascending=False)

print("\nLogistic Regression Feature Importance:")
display(log_importance.head(10))

# Decision Tree Feature Importance
tree_importance = pd.DataFrame({
    "feature": X.columns,
    "importance": tree.feature_importances_
}).sort_values(by="importance", ascending=False)

print("\nDecision Tree Feature Importance:")
display(tree_importance.head(10))

# Random Forest Feature Importance
rf_importance = pd.DataFrame({
    "feature": X.columns,
    "importance": rf.feature_importances_
}).sort_values(by="importance", ascending=False)

print("\nRandom Forest Feature Importance:")
display(rf_importance.head(10))

============================
MODEL: Logistic Regression
============================
Confusion Matrix:
 [[2564   14]
 [   6 4290]]


============================
MODEL: Decision Tree
============================
Confusion Matrix:
 [[2578    0]
 [   0 4296]]


============================
MODEL: Random Forest
============================
Confusion Matrix:
 [[2578    0]
 [   0 4296]]


Logistic Regression Feature Importance:
feature importance
14 syear 3.182438
13 smonth 0.488385
22 quality_C 0.141556
35 arcstyle_MIDDLE UNIT 0.050482
45 qualified_U 0.038630
40 arcstyle_TRI-LEVEL 0.034966
8 numbdrm 0.034715
29 gartype_None 0.029138
37 arcstyle_ONE-STORY 0.027118
18 condition_Good 0.021895

Decision Tree Feature Importance:
feature importance
4 yrbuilt 1.0
0 abstrprd 0.0
2 finbsmnt 0.0
1 livearea 0.0
3 basement 0.0
5 totunits 0.0
6 stories 0.0
7 nocars 0.0
8 numbdrm 0.0
9 numbaths 0.0

Random Forest Feature Importance:
feature importance
4 yrbuilt 0.523443
37 arcstyle_ONE-STORY 0.054863
6 stories 0.048633
9 numbaths 0.043491
25 gartype_Att 0.033880
1 livearea 0.033730
22 quality_C 0.026385
0 abstrprd 0.021284
21 quality_B 0.020934
10 sprice 0.019039

STRETCH QUESTION|TASK 2

Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.

After joining the neighborhood data to the main dataset, all three models improved slightly thanks to the added neighborhood features. Logistic Regression still performed well with only a few errors. The Decision Tree continued to overfit, predicting perfectly by memorizing the data. Random Forest remained the strongest and most stable model, achieving near-perfect accuracy without overfitting. Even with the expanded dataset, Random Forest is still the best model to recommend to the client.

Show the code
# Include and execute your code here



import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# ----------------------------------------------------------
# 1. LOAD BOTH DATASETS FROM GITHUB
# ----------------------------------------------------------
df1 = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv")
df2 = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")

# Merge on parcel
df_merged = df1.merge(df2, on="parcel", how="left")

# ----------------------------------------------------------
# 2. PREPARE FEATURES AND TARGET
# ----------------------------------------------------------
X = df_merged.drop(columns=["before1980", "parcel"])
y = df_merged["before1980"]

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# ----------------------------------------------------------
# 3. DEFINE MODELS (Same as Stretch Question 1)
# ----------------------------------------------------------
log_reg = LogisticRegression(max_iter=2000)
tree = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(n_estimators=300, random_state=42)

models = {
    "Logistic Regression": log_reg,
    "Decision Tree": tree,
    "Random Forest": rf
}

# ----------------------------------------------------------
# 4. TRAIN MODELS + CONFUSION MATRIX
# ----------------------------------------------------------
for name, model in models.items():
    print(f"\n============================")
    print(f"MODEL: {name}")
    print("============================")

    model.fit(X_train, y_train)
    preds = model.predict(X_test)

    # Confusion matrix
    cm = confusion_matrix(y_test, preds)
    print("Confusion Matrix:\n", cm)

    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(cmap="Blues")
    plt.title(f"{name} - Confusion Matrix")
    plt.show()



# Logistic Regression (coefficients)
log_imp = pd.DataFrame({
    "feature": X.columns,
    "importance": log_reg.coef_[0]
}).sort_values(by="importance", ascending=False)

print("\nLogistic Regression Feature Importance (Top 15):")
display(log_imp.head(15))

# Decision Tree
tree_imp = pd.DataFrame({
    "feature": X.columns,
    "importance": tree.feature_importances_
}).sort_values(by="importance", ascending=False)

print("\nDecision Tree Feature Importance (Top 15):")
display(tree_imp.head(15))

# Random Forest
rf_imp = pd.DataFrame({
    "feature": X.columns,
    "importance": rf.feature_importances_
}).sort_values(by="importance", ascending=False)

print("\nRandom Forest Feature Importance (Top 15):")
display(rf_imp.head(15))

============================
MODEL: Logistic Regression
============================
Confusion Matrix:
 [[3189    7]
 [  15 5178]]


============================
MODEL: Decision Tree
============================
Confusion Matrix:
 [[3196    0]
 [   0 5193]]


============================
MODEL: Random Forest
============================
Confusion Matrix:
 [[3195    1]
 [   0 5193]]


Logistic Regression Feature Importance (Top 15):
feature importance
15 syear 3.377513
14 smonth 0.208560
38 arcstyle_ONE-STORY 0.038318
23 quality_C 0.037784
8 numbdrm 0.030064
19 condition_Good 0.029872
29 gartype_Det 0.025867
47 status_I 0.014244
230 nbhd_624 0.011063
46 qualified_U 0.008451
37 arcstyle_ONE AND HALF-STORY 0.007799
5 totunits 0.007620
30 gartype_None 0.006177
270 nbhd_668 0.005462
41 arcstyle_TRI-LEVEL 0.003858

Decision Tree Feature Importance (Top 15):
feature importance
4 yrbuilt 1.0
1 livearea 0.0
258 nbhd_655 0.0
259 nbhd_656 0.0
260 nbhd_657 0.0
261 nbhd_658 0.0
262 nbhd_659 0.0
263 nbhd_660 0.0
264 nbhd_661 0.0
9 numbaths 0.0
266 nbhd_664 0.0
267 nbhd_665 0.0
268 nbhd_666 0.0
269 nbhd_667 0.0
270 nbhd_668 0.0

Random Forest Feature Importance (Top 15):
feature importance
4 yrbuilt 0.370957
6 stories 0.051141
38 arcstyle_ONE-STORY 0.046712
9 numbaths 0.038721
1 livearea 0.033096
26 gartype_Att 0.031679
44 arcstyle_TWO-STORY 0.028143
23 quality_C 0.025719
54 nbhd_101 0.021712
10 sprice 0.020271
0 abstrprd 0.020103
12 netprice 0.019715
13 tasp 0.019456
22 quality_B 0.017984
16 condition_AVG 0.017641

STRETCH QUESTION|TASK 3

Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.

type your results and analysis here

Show the code
# Include and execute your code here

# ----------------------------------------------------------
# STRETCH QUESTION | TASK 3 - Predicting Year Built
# ----------------------------------------------------------

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Use merged dataset from previous task (or replace with df)
df = df_merged.copy()

# ----------------------------------------------------------
# 1. Prepare Features and Target
# ----------------------------------------------------------
X = df.drop(columns=["yrbuilt", "parcel"])   # features
y = df["yrbuilt"]                             # target variable

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# ----------------------------------------------------------
# 2. Define and Train Regression Model
# ----------------------------------------------------------
rf_reg = RandomForestRegressor(
    n_estimators=300,
    random_state=42
)

rf_reg.fit(X_train, y_train)

# ----------------------------------------------------------
# 3. Predictions and Evaluation Metrics
# ----------------------------------------------------------
y_pred = rf_reg.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("===== YEAR BUILT REGRESSION RESULTS =====")
print(f"MAE (Mean Absolute Error): {mae:.2f} years")
print(f"RMSE (Root Mean Squared Error): {rmse:.2f} years")
print(f"R² Score: {r2:.4f}")

# ----------------------------------------------------------
# 4. Feature Importance (Top 15)
# ----------------------------------------------------------
importances = pd.DataFrame({
    "feature": X.columns,
    "importance": rf_reg.feature_importances_
}).sort_values(by="importance", ascending=False)

print("\nTop 15 Important Features:")
display(importances.head(15))
===== YEAR BUILT REGRESSION RESULTS =====
MAE (Mean Absolute Error): 4.67 years
RMSE (Root Mean Squared Error): 9.19 years
R² Score: 0.9395

Top 15 Important Features:
feature importance
48 before1980 0.690041
28 gartype_Det 0.050045
3 basement 0.034363
5 stories 0.030446
25 gartype_Att 0.025057
1 livearea 0.014159
0 abstrprd 0.013484
12 tasp 0.013339
11 netprice 0.008824
2 finbsmnt 0.008728
6 nocars 0.006071
22 quality_C 0.005923
7 numbdrm 0.005827
9 sprice 0.005387
108 nbhd_230 0.004204